White Wine Quality Exploration

This report explores a dataset containing quality and attributes for approximately 4898 observations. And I will analyze about how the quality of white wines is affected by other attributes. Moreover, I will explore if there are some relationship among other attributes.

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Our dataset consists of 13 variables, with 4898 observations.

Firsly, I want to see the distribution about the quality of white wine.

I transfer ‘quality’ numeric variable to factor variable.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

From this, we could see that most wines have quality 5, 6 and 7. And there are no quality for 0, 1, 2, 10.

Then, I am curious about the effect to the quality by different ingradients in white wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

After I created this plot, we could see the distribution of fixed.acidity for white wines. In the histogram, most wines have fixed acidity between 5.8g/dm^3 ~ 7.8g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

After I created this plot, we could see the distribution of residual.sugar for white wines. In the histogram, it is a little bimodel ditribution on the logq0 scale, most wines have residual.sugar at around 1.5g/dm^3 and 8g/dm^3 to 12g/dm^3. From the summary, we find there are some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

After I created this plot, we could see the distribution of chlorides for white wines. In the histogram, most wines have chlorides between 0.03g/dm^3 ~ 0.06g/dm^3. And from the summary, we find there are some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

After I created this plot, we could see the distribution of total.sulfur.dioxide for white wines. In the histogram, most wines have total.sulfur.dioxide between 100g/dm^3 ~ 160g/dm^3. And we could also see that it is close to a normal distribution. From the summary, we find there are a few outliers. The mean is 138.4 and the median is 138.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

After I created this plot, we could see the distribution of pH value of white wines. In the histogram, most wines have pH value between 3.0 ~ 3.3. And we could also see that it is close to a normal distribution.

From the summary, the mean is 3.188 and the median is 3.180.

Then, we will analyze the density distribution of white wines.

summary(wq$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

From this summary, we could calculate IQR=0.9961-0.9917=0.0044. So the upper fence is 0.9961+1.5IQR = 1.0027; the lower fence is 0.9917-1.5IQR = 0.9815.

The histogram shows that the density distribution is almostly normal. in the summary, we find there are some outliers for density.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

After I created this plot, we could see the distribution of alcohol percentage of white wines. In the histogram, wines alcohol percentage is more average than other attributes. Even the median and mean of alcohol are close, it is not normal distribution.

Next, I am curious about the sweet level of white wines, so I will creat a new variable ‘sweetness’ for the further analysis.

##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.0             0.27        0.36           20.7     0.045
## 2   2           6.3             0.30        0.34            1.6     0.049
## 3   3           8.1             0.28        0.40            6.9     0.050
## 4   4           7.2             0.23        0.32            8.5     0.058
## 5   5           7.2             0.23        0.32            8.5     0.058
## 6   6           8.1             0.28        0.40            6.9     0.050
## 7   7           6.2             0.32        0.16            7.0     0.045
## 8   8           7.0             0.27        0.36           20.7     0.045
## 9   9           6.3             0.30        0.34            1.6     0.049
## 10 10           8.1             0.22        0.43            1.5     0.044
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   45                  170  1.0010 3.00      0.45     8.8
## 2                   14                  132  0.9940 3.30      0.49     9.5
## 3                   30                   97  0.9951 3.26      0.44    10.1
## 4                   47                  186  0.9956 3.19      0.40     9.9
## 5                   47                  186  0.9956 3.19      0.40     9.9
## 6                   30                   97  0.9951 3.26      0.44    10.1
## 7                   30                  136  0.9949 3.18      0.47     9.6
## 8                   45                  170  1.0010 3.00      0.45     8.8
## 9                   14                  132  0.9940 3.30      0.49     9.5
## 10                  28                  129  0.9938 3.22      0.45    11.0
##    quality  sweetness
## 1        6     medium
## 2        6        dry
## 3        6 medium dry
## 4        6 medium dry
## 5        6 medium dry
## 6        6 medium dry
## 7        6 medium dry
## 8        6     medium
## 9        6        dry
## 10       6        dry
##        dry medium dry     medium      sweet 
##       2410       1662        825          1

From this histogram, we could see the distribution of sweet level of white wines. Most white wines are dry and medium dry, which account for 83% among all white wines in this dataset.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observation in this dataset with 13 variables. There are “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”, “sweetness”.

I transfer the variable quality into factor, and all other variable are numeric variables.

worst —> best quality 0,1,2,3,4,5,6,7,8,9,10

Other observations: 1. Most white wines have quality 6. 2. A lot of wines have fixed acidity between 5.8g/dm^3 ~ 7.8g/dm^3. 3.For the ingredients residual.sugar and chlorides, they both have a long tail and some outliers. Many white wines have residual.sugar around 1.5g/dm^3 4.The distributions for total.sulfur.dioxide and pH of white wines are close to normal distribution. 5.Alcohol percentage in white wines is more average than other attributes, not normal distribtuion. 6.Most white wines are dry and medium dry, which account for 83% among all white wines in this dataset.

What is/are the main feature(s) of interest in your dataset?

The main feature in this data set is about how different ingredients affect the white wine quality.The purpose of this project is to analyze the quality related to following ingredients: fixed.acidity, residual.sugar, chlorides, total.sulfur.dioxide, pH, alcohol.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The other feature sweetness will also help us to investigate the quality of the white wines, because the sweet level affect the taste of white wines, so it is influence the experts to grade the wines.

Did you create any new variables from existing variables in the dataset?

I created a variable sweetness which represents the sweet level of white wines. There are four levels for sweetness: dry, medium dry, medium, sweet.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

For the quality variable, the range is from 3 to 9. There is no quality for 0, 1, 2 or 10. I think this is beacause the grades are given by the experts, so the grades are very subjective. They don’t give very low or full grades for the white wines.

Wines alcohol percentage is more average than other attributes. Even the median and mean of alcohol are close, it is not normal distribution.

For sweet levels of white wines, there is only one wine is in sweet catagory in this data set. Maybe it is beacause white wines are not supposed to be very sweet.

Bivariate Plots Section

Firstly, I want to explore more about the correlation of coefficient for each variable.

##                      fixed.acidity residual.sugar    chloride
## fixed.acidity           1.00000000     0.08902070  0.02308564
## residual.sugar          0.08902070     1.00000000  0.08868454
## chloride                0.02308564     0.08868454  1.00000000
## total.sulfur.dioxide    0.09106976     0.40143931  0.19891030
## density                 0.26533101     0.83896645  0.25721132
## pH                     -0.42585829    -0.19413345 -0.09043946
## alcohol                -0.12088112    -0.45063122 -0.36018871
## quality                -0.11366283    -0.09757683 -0.20993441
## sweetness               0.08462378     0.93375334  0.09117180
##                      total.sulfur.dioxide     density           pH
## fixed.acidity                 0.091069756  0.26533101 -0.425858291
## residual.sugar                0.401439311  0.83896645 -0.194133454
## chloride                      0.198910300  0.25721132 -0.090439456
## total.sulfur.dioxide          1.000000000  0.52988132  0.002320972
## density                       0.529881324  1.00000000 -0.093591493
## pH                            0.002320972 -0.09359149  1.000000000
## alcohol                      -0.448892102 -0.78013762  0.121432099
## quality                      -0.174737218 -0.30712331  0.099427246
## sweetness                     0.407802136  0.78928904 -0.186400065
##                         alcohol     quality   sweetness
## fixed.acidity        -0.1208811 -0.11366283  0.08462378
## residual.sugar       -0.4506312 -0.09757683  0.93375334
## chloride             -0.3601887 -0.20993441  0.09117180
## total.sulfur.dioxide -0.4488921 -0.17473722  0.40780214
## density              -0.7801376 -0.30712331  0.78928904
## pH                    0.1214321  0.09942725 -0.18640006
## alcohol               1.0000000  0.43557472 -0.46255482
## quality               0.4355747  1.00000000 -0.10204075
## sweetness            -0.4625548 -0.10204075  1.00000000

In order to see the correlation of coefficient between all variables, I create a dataframe M and plot M as a correlation matrix.

I creat this plot filled by quality, so as to see the quality distriution in fixed.acidity. The most quality are distributed normally in fixed.acidity.

From these two plots, we could see that the qualit 3 has widest range in fixed.acidity, almost two times than others. And the quality 9 has a very narrow range, only half of others.

## wq$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.575   7.300   7.600   8.525  11.800 
## -------------------------------------------------------- 
## wq$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.800   6.400   6.900   7.129   7.600  10.200 
## -------------------------------------------------------- 
## wq$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   6.400   6.800   6.934   7.400  10.300 
## -------------------------------------------------------- 
## wq$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.838   7.300  14.200 
## -------------------------------------------------------- 
## wq$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.200   6.700   6.735   7.200   9.200 
## -------------------------------------------------------- 
## wq$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.800   6.657   7.300   8.200 
## -------------------------------------------------------- 
## wq$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.60    6.90    7.10    7.42    7.40    9.10

From this table, we could assure what we find in the two plots. The most quality are distributed averagely, and the qualit 3 has widest range while the quality 9 is narrow range.

This plot displays the quality distriution in residual.sugar. This distribution shows each quality is kind of right skew.

This plot displays the quality distriution in chlorides. Most quality are distributed normally in chlorides log10.

This plot shows that the distriution of every quality in total.sulfur.dioxide is alomost normal.

This dendity stacking by quality plot shows the quality distriution in density, and the distribution is kind of normal for each quality in density.

From this boxplot of each quality in density, we could find that the highest quality probably has lowest density. For the quality 9 in the plot, it has lower density than any others.

## wq$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## wq$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## wq$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## wq$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wq$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## wq$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## wq$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

From this table, we may speculate that the better quality white wines have lower density relatively.

This plot shows that the distriution of every quality in pH is normal.

This plot still displays the quality distribution in alcohol. But it is a little special than other attributes. Most quality distribute normally in alcohol range between 8% to 12%, while there are almost all white wines above quality 5 after alcohol percentage is 12 to 14.

## wq$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wq$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wq$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wq$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wq$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wq$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wq$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

Form the boxplot and statistics I create above, we may infer that the high quality white wines are usually with high alcohol percentage.

The histogram of each quality in alcohol percentage display to us more clearly that most quality distribute normally in alcohol range between 8% to 12%. However, the quality above 5 also have distribution between 12% to 14%. So it is probable beacause the good quality white wines have higher alcohol percentage.

## wq$quality: 3
##        dry medium dry     medium      sweet 
##         11          6          3          0 
## -------------------------------------------------------- 
## wq$quality: 4
##        dry medium dry     medium      sweet 
##        104         47         12          0 
## -------------------------------------------------------- 
## wq$quality: 5
##        dry medium dry     medium      sweet 
##        578        565        314          0 
## -------------------------------------------------------- 
## wq$quality: 6
##        dry medium dry     medium      sweet 
##       1063        766        368          1 
## -------------------------------------------------------- 
## wq$quality: 7
##        dry medium dry     medium      sweet 
##        551        225        104          0 
## -------------------------------------------------------- 
## wq$quality: 8
##        dry medium dry     medium      sweet 
##         99         52         24          0 
## -------------------------------------------------------- 
## wq$quality: 9
##        dry medium dry     medium      sweet 
##          4          1          0          0

Form the histogram and statistics above, the number of each quality of white wines gets less and less by sweet levels from dry to sweet. We may speculate that most white wines are dry and medium dry.

I create this scatterplot bewteen residual.sugar and density with linear smoother, we could see that residual.sugar has a strong linear correlation with density. It is probable that the higher residual.sugar white wines have higher density.

I create this scatterplot bewteen alcohol and density with linear smoother, we could see that alcohol has a strong linear correlation with density. It is probable that the higher residual.sugar white wines have lower density.

These two plots both display that the density of white wines get higher and higher while the sweet level increases. it also corresponds with the feature between density and residual.sugar in white wines we analyzed above.

I create this scatterplot bewteen total.sulfur.dioxide and density with linear smoother, we could see that total.sulfur.dioxide has moderate linear correlation with density. Maybe density of white wines increases more or less when total.sulfur.dioxide is higher.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  1. The quality of white wines is not affected by these attributes that much.

  2. The correlation coefficient about quality and these attributes are mostly under 0.3. Only the coefficient of quality and density is 0.31, and quality and alcohol is 0.44. That indicates that the quality of white wines may only have small influence by density and alcohol.

  3. The density of white wines have some linear relationship with some other attributes. It has stong linear correlation with residual.sugar and alcohol. The density goes up when residual.sugar goes up; while the density goes down when alcohol percentage goes up. The density has moderate relationship with total.sulfur.dioxide.

  4. Since the number of each quality of white wines all decrease by sweet levels from dry to sweet. We may infer that most white wines are dry and medium dry.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Although the quality of white wines does not affect by these attributes that much, yet the density of white wines have linear correlation with residual.sugar, alcohol and total.sulfur.dioxide in the wines.

The density of white wines goes up when residual.sugar goes up; while the density goes down when alcohol percentage goes up.

What was the strongest relationship you found?

The strongest relationship is between residual.sugar and sweetness. That is because I divide the sweetness level just by residual.sugar in white wines.

Multivariate Plots Section

In this scatterplot filled by quality variable, we could see that the density increases gradually while the residual.sugar goes up.

And the color of quality 8 and 9 are mostly at the bottom while the color of low quality are above when at the same value of residual.sugar. That maybe because the higher quality white wines have lower density relatively.

moreover, from the plot, we could find that a lot of points of each quality level are at the left side of x axis. That infers most white wines have low residual.sugar no matter what quality they are.

This scatterplot displays that density and total.sulfur.dioxide have moderate linear correlation. And for each quality, most of the points are around the center of x axis, but less and less for two side. That indicates every quality distributes normally in total.sulfur.dioxide.

In this scatterplot filled by quality variable, we could see that the density decreases gradually while the alcohol percentage goes up.

And on direction of x axis, when the alcohol percentage is more the 12, there is no quality 3 or quality 4.

So from these, they may indicate that the better quality white wines have higher alcohol percentage with lower density as well.

This picture just displays what we explored density, sweetness and quality before more visually. The density of white wines increases while the sweet level goes up. And at the same time, the number of each quality decreases from dry to sweet, which means all white wines, no matter what quality, are usually dry and medium dry. Most blue and purple color which represent quality 7 and 8 are on left side of x axis with lower density.

This picture shows us density and residual.sugar do have strong relationship. And better quality white wines have lower density relatively. Drier white wines usually have lower density as well. But Sugar dose not influence the quality that much.

This picture shows us density and total.sulfur.dioxide have moderate relationship. But sulfur.dioxide dose not influence the quality that much.

This picture shows us density and alcohol do have strong relationship. Moreover, higher quality white wines are kind of with lower density and higher alcohol percentage. However, the qualtiy is not affected by sweetness.

## 
## Call:
## lm(formula = I(as.numeric(quality)) ~ alcohol + chlorides + citric.acid + 
##     density + fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity, data = wq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.482e+02  1.880e+01   7.881 3.98e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

From this linear model, R-squared is just 0.2819, which means the fit of this model is not very good. Only around 28% quality of white wines are due to these attributes. It also indicates what we found before that the quality of white wines are not affected much by these attributes.

Secondly, Significance Stars represents significance levels, with the number of asterisks displayed according to the p-value computed. *** means high significance while one star means low significance. In this case, alcohol, density, free.sulfur.dioxide, pH, residual.sugar, sulphates and volatile.acidity have three stars, indicating that it’s unlikely that no relationship exists between quality and these variables. # Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There are not strong relationship for the quality of white wines with these attributes in the dataset. Only the density have some linear relationship with sugar and alcohol in white wines.

Were there any interesting or surprising interactions between features?

There are not surprising interactions between features in this dataset.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I build the linear model for the qualtiy and other attributes in White wines. And I found the quality isn’t related to these attributes that much, which means the linear model doesn’t predict very well for the quality from these attributes in white wines.

The limitation of this model is these data are just from one manufacturer. So we don’t know what about the white wines from other wines factories. Also, this model can’t represent for the white wines from other manufacturers.


Final Plots and Summary

Plot One

Description One

From this plot, we could see that the distribution of white wines quality is almost normal distribution. Most wines have quality 5, 6 and 7. And there are no quality for 0, 1, 2, 10. I think this is because the grade is given by the experts who taste the white wines, so the grades are subjective. Maybe the experts don’t want to give the too low grades or full grade.

Plot Two

Description Two

In this density vs sugar scatterplot stacked by quality, we could see that the density increases gradually while the residual.sugar goes up as well.

It also displays that the color of quality 8 and 9 are mostly at the bottom while the color of low quality are above when at the same value of residual.sugar. That maybe because the higher quality white wines have lower density relatively.

Futhermore, from the plot, we could see that a lot of points of each quality level are at the left side of x axis. That indicates most white wines have low residual.sugar no matter what quality they are.

Plot Three

Description Three

Firstly, the dendity stacking by quality plot displays the quality distriution in density, and the distribution is close to normal distribution for each quality in density.

Secondly, The density vs alcohol scatterplot filled by quality displays us the relationship among these variables more clear. We could see that the density decreases gradually while the alcohol percentage goes up.

What’s more, from the color distribution of different color, when the alcohol percentage is more the 12, there is very few white wines with quality 3 or quality 4.

So from these plots, they may indicate that the better quality white wines have higher alcohol percentage with lower density as well.


Reflection

In the summary, the quality of white wines are not influenced by other attributes. The qualtiy don’t have much correlation with these attributes. We could only speculate that higher quality white wines may be with lower density and higher alcohol percentage. And most wines have medium quality with grade 5, 6 and 7.

An surprising thing is the density has some strong linear correlation with residual.sugar, alcohol in white wines. However, the correlation among other attributes are all small. Their correlation coeffecient are all less than 0.5.

When I was working on this dataset, I found the challenge of exploratory data analysis is what plot I should use for more clear visulization for the analysis and how to firgure out the relationship among many variables. Also, when we explore some specific dataset, we need to have some preparation for the background of the data, so that we could figure out and speculate the data better in the particular context.

For this white wines data, the limitation is that the source of this data is only from one manufacturer. If there are more data from variety manufacturer, the analysis and speculation of the atttibutes in white wines could be more confident and persuasive.

Reference

  1. Sweetness of wine https://en.wikipedia.org/wiki/Sweetness_of_wine

  2. Analysis of White Wine Quality https://rstudio-pubs-static.s3.amazonaws.com/249236_218e87eee0b94a05acec856159875cd5.html

  3. White Wine Quality Exploration by Swain Tseng https://rpubs.com/Swain/205356

  4. Diamonds Exploration by Chris Saden https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html

  5. Fitting & Interpreting Linear Models in R http://blog.yhat.com/posts/r-lm-summary.html